Eatomics is a R-Shiny based web application that enables interactive exploration of quantitative proteomics data generated by MaxQuant software. The application contains a suite of both novel and established bioinformatics tools (mainly as R functions), wrapped into four major tabpanels, performing:
Featuring quick and easy interpretation of the MaxQuant data even for researchers with limited programming background. Additionally, the interactive clinical feature selection developed, enables the user to explore the differentially expressed protein amongst the clinical parameter in one-click.
A recent version of Rstudio
As our Eatomics shiny application integrates several data analysis tools using R, a series of R packages are run on the background as mentioned in ´Details/Imports´: in the Package Description file.
The app requires two file inputs:
The name of the samples shall be identical in both input files. Also, represent character variables as ‘factor’ and integers as ‘numeric’. One limitation is, ordinal values are treated as numeric. text
To see an example on the input files, please refer example files
Overview of the tab panels
The tab enables the upload and preparation of both input files. The tab uses set of novel R functions (in italics) to filter and visualize the quality of the data. Additionally, also adapts a series of existing data filtering functions from several other R packages such as DEP. The tabpanel provides an overview on the data quality and enables the user to filter and prepare data for downstream expression analysis.
The following subsections explain the tab in detail.
To begin the analysis the user has to upload the MaxQuant file (e.g.proteinGroups.txt), which can be done in two ways:
The clinical data input file (e.g clinicaldata.txt) is also uploaded in the similar way as explained above.
The following section explains the details of the widgets used in the tab.
The impute widget uses the impute {DEP} function to impute missing values in a proteomics dataset using multiple imputation methods
The exclude column widget allows the user to exclude samples, especially if any outliers found after conducting initial quality analysis such as PCA, thus avoiding the sample from the consecutive steps in differential analysis.
Screenshot of the widgets
Our Eatomics app wraps several functions to visualize the quality control of the uploaded proteinGroups.txt file.
Screenshot on the visualization plots developed anew are only provided in the documentation. Others, have been majorly adapted from DEP package. text
A common method of dimensionality reduction is principal component analysis (PCA). Inherently, PCA calculates axes of most variation (principal components) within the expression data. A common assumption is that a plot along the axes of most variation will segregate all samples/patients into groups under investigation.
Fig.1.Screenshot of the PCA plot widget
Gene lists for the families GATA, JUN, SH2, SOX and TBX were downloaded from https://www.genenames.org on 23.03.2017. Please note, that in fact not all members of these families are necessarily low-abundant.
Fig.2.Screenshot of the Low abundance transcription factors plot widget
Protein numbers describes the count of distinct proteins or isoforms per sample. The plot is generated by the plot_numbers() function from the DEP package which was adjusted to work without experimental design information.
Sample coverage describes how many proteins are shared across samples. I.e. there will be proteins that were detected in only one sample (top of the plot) or proteins detected in all samples (bottom of the plot). The plot is generated by the plot_coverage() function from the DEP package which was adjusted to work without experimental design information.
The missing values heatmap shows the distribution of missing values across samples. White areas mark missing values, while black is a valid value. If samples cluster based on their pattern of missing values, caution should be taken in further analysis as this pattern may perturb, e.g., differential expression analysis. In well-founded cases, samples should be removed from the analysis. The plot is generated by the plot_missval() function from the DEP package.
The missing values plot visualizes the density and cumulative protein intensities with and without missing values. It can guide the selection of the missing value imputation strategy. The plot is generated by the plot_impute() function from the DEP package which was adjusted to work without experimental design information.
The missing value imputation plot shows the density of missing and valid values over range of the spectrum of intensities before and after imputation with the selected method. The plot is generated by the plot_detect() function from the DEP package which was adjusted to work without experimental design information.
The sample-to-sample heatmap describes the biological and/or technical variability of the samples. It can be produced using different (dis-) similarity metrics (Pearson correlation or euclidean distance). For this plot ´r params$StSheatmapDistMetric´ was used. Formed clusters should resemble the sample groups under investigation.
Fig.3.Screenshot of Sample to Sample Heatmap
Protein intensities are cumulated across all samples and plotted according to their relative abundance. Colouring marks the respective quantile of the proteins. Highly abundant proteins, i.e., proteins ranked in the first quartile are colored in red and labels are specified. The top 20 ranked proteins and their cumulated intensity are given in the table to the right.
Fig.4.Screenshot of Cumulative Protein Intensities
limma (linear models for microarray data), is a commonly used R/Bioconductor software package for analyzing microarray and RNA-seq data. Limma functions fit statistical linear models and conducts differential expression analysis.
Eatomics uses limma to perform real time analysis of differentially expressed proteins amongst clinical parameters of choice. The resulting interactive visualization plot including volcanoplots (detailed below) allows a quick and detailed overview on the differetial expression analysis.
The clinical group widget allows the user to select the clinical parameter of their choice from the uploaded dataset, to perform limma analysis.
The compare widget shows all the available subgroups inside the selected clinical group. The user then selects two subgroups to compare and find the differential expressed genes in the selected subgroups.
The surrogate variable widget uses the sva{sva} function to remove the unknown source of variation such as batch effects from the expression value of the compared clinical subgroup. It is better to impute before, checking surrogate variable, as the sva function works better with non-missing values
The impute values provides the user the flexibility to choose the imputed MaxQuant expression values (from the Load and Prep tab)
The apply filters widget allows to extend filter to a second clinical parameter, where the user can narrow down the filtering to very specific subgroup of another clinical parameter of interest
Using threshold box widget, the user can interactively adjust for multiple testing corrected (adjusted) P-value and log FoldChange(logFC) , plotting the genes above the chosen threshold in the volcano plot
lmFit{limma} along with other liner model functions from limma performs the differential expression analysis on the selected clinical group. Later, the resulting table to top-ranked genes{limma} the volcanoplot is plotted. In the plot, the x axis plots the log-2-fold change value (`lfc´ from the table) where logFC> 0 means the genes are up-regulated and logFC<0 meaning genes down regulated in the clinical group selected first (i.e,cases in Fig.5.) and the y axis plots the -log10 of P value (´p.value´) from the gene list.
Fig.5.VolcanoPlot
The list of the genes up and down regulated (once that crosssed the given threshold) and their summary including the gene name, log-2-foldchange, P value and adjusted P value (adj.P.Val) is represented in the respective up and down regulated tables.
Fig.6.Screenshot of Up and Down regulated gene list
The collapsible panel contains details on the input paramaters that were selected by the user through the limma widgets.
Fig.7.Screenshot of Detailed Description collapsible tab panel
The tab provides an easy to use user-interface on the established (ssGSEA) (single-sample Gene Set Enrichment Analysis) method which is an extension of GSEA. The interactive widgets in the tab are developed based on the freely accessed codes developed by Broad institute 1. The tab currently focuses only on four MSigDB(version v6.2) genesets namely, H- hallmark geneset, C2- KEGG geneset, C5- GO geneset and C1- positional geneset to calculate the enrich ment score and the user selects one geneset from these (similar to conventional GSEA).
Later, the resulting output,the .gct files with the respective geneset enrichment score will be auto-downloaded into the working directory, into the folder called EnrichmentScore.
1: Krug, K., et al., A Curated Resource for Phosphosite-specific Signature Analysis. Mol Cell Proteomics, 2019. 18(3): p. 576-593.
A message alert will be pop-up on download completion.
Screenshot of the ssGSEA widgets
Fig.8.Screenshot of ssGSEA tab panel
Differential expression analysis of the pathways enriched via ssGSEA (using the previou tabpanel) in the clinical group of interest,is facilitated by the tab. The pathway enrichment scores geneset is tested for differential pathway expression using t.test. The test compares the enriched pathways differential expressed among two clinical conditions selected.
EnrichmentScore folder, provided the user has saved the file using ssGSEA tab. The score file is .gct file, a m x n matrix where m rows contain the name of the pathway from the selected geneset and n number of column represents the sample names matching to the clinical data.
Make sure the sample names (column names) format in the geneset collection matches to the sample names (PatientID) in the clinical data file. text
*Rest of the widgets in the panel follows the same pattern in the ‘differential limma’ tab
The interactive volcanoplot plots the log ratio (x-axis) vs -log10 of P value calsulated based of t.test. The bonferroni adjusted P value (adj.p< 0.05) is selected as the threshold to highlight the differentially expressed pathways.
Fig.9.Screenshot of VolcanoPlot
The other two result formats including interactive data table and the detailed description panel lists the highlighted up and and down regulated pathways and details on the input parameters, respecively.
Shiny R package
R
|app.R
|helpers.R
|run_app_function.R
vignettes
|About.rmd
|
man
|helpers_roxygen.R
|
markdown_docs
|Tutorial.rmd
|About.rmd
Input Files
|proteinGroups.txt
|clindata.txt
Datasets
MSigDB
|c1.all.v6.1.symbols.gmt
|c2.cp.kegg.v6.1.symbols.gmt
|c5.all.v6.1.symbols.gmt
|h.all.v6.1.symbols.gmt
TranscriptionFactors
|GATA-familyHuman230318.txt
|JUN-familyHuman230318.txt
|SH2-familyHuman230318.txt
|SOX-familyHuman230318.txt
|TBX-familyHuman230318.txt
EnrichmentScore
Report
|
|DESCRIPTION
|
|NAMESPACE
|
|README.md
|